{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.10"
    },
    "colab": {
      "name": "10_Exploring_Quantum_Chemistry_with_GDB1k.ipynb",
      "provenance": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": true,
        "id": "Rqb9ef8F2UJW",
        "colab_type": "text"
      },
      "source": [
        "# Tutorial Part 10: Exploring Quantum Chemistry with GDB1k"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IcM5fm932UJY",
        "colab_type": "text"
      },
      "source": [
        "Most of the tutorials we've walked you through so far have focused on applications to the drug discovery realm, but DeepChem's tool suite works for molecular design problems generally. In this tutorial, we're going to walk through an example of how to train a simple molecular machine learning for the task of predicting the atomization energy of a molecule. (Remember that the atomization energy is the energy required to form 1 mol of gaseous atoms from 1 mol of the molecule in its standard state under standard conditions).\n",
        "\n",
        "## Colab\n",
        "\n",
        "This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/10_Exploring_Quantum_Chemistry_with_GDB1k.ipynb)\n",
        "\n",
        "## Setup\n",
        "\n",
        "To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hiRnnJpG2UJY",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 153
        },
        "outputId": "4ccce479-ab8f-4b55-a00b-9554d53d874d"
      },
      "source": [
        "!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py\n",
        "import conda_installer\n",
        "conda_installer.install()\n",
        "!/root/miniconda/bin/conda info -e"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
            "                                 Dload  Upload   Total   Spent    Left  Speed\n",
            "\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100  3489  100  3489    0     0  37923      0 --:--:-- --:--:-- --:--:-- 37923\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "all packages is already installed\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "# conda environments:\n",
            "#\n",
            "base                  *  /root/miniconda\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rqGp9hYVBUyQ",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 188
        },
        "outputId": "73b2f101-82a4-4299-a837-5b55c2e3a7a9"
      },
      "source": [
        "!pip install --pre deepchem\n",
        "import deepchem\n",
        "deepchem.__version__"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: deepchem in /usr/local/lib/python3.6/dist-packages (2.4.0rc1.dev20200805143010)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)\n",
            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)\n",
            "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)\n",
            "Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'2.4.0-rc1.dev'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ub1J6G5w2UJd",
        "colab_type": "text"
      },
      "source": [
        "With our setup in place, let's do a few standard imports to get the ball rolling."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "19IsqJhx2UJe",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import os\n",
        "import unittest\n",
        "import numpy as np\n",
        "import deepchem as dc\n",
        "import numpy.random\n",
        "from deepchem.utils.evaluate import Evaluator\n",
        "from sklearn.ensemble import RandomForestRegressor\n",
        "from sklearn.kernel_ridge import KernelRidge"
      ],
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AssRCAgB2UJi",
        "colab_type": "text"
      },
      "source": [
        "The ntext step we want to do is load our dataset. We're using a small dataset we've prepared that's pulled out of the larger GDB benchmarks. The dataset contains the atomization energies for 1K small molecules."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "j5PUW7452UJi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tasks = [\"atomization_energy\"]\n",
        "dataset_file = \"../../datasets/gdb1k.sdf\"\n",
        "smiles_field = \"smiles\"\n",
        "mol_field = \"mol\""
      ],
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hs0RDgHN2UJm",
        "colab_type": "text"
      },
      "source": [
        "We now need a way to transform molecules that is useful for prediction of atomization energy. This representation draws on foundational work [1] that represents a molecule's 3D electrostatic structure as a 2D matrix $C$ of distances scaled by charges, where the $ij$-th element is represented by the following charge structure.\n",
        "\n",
        "$C_{ij} = \\frac{q_i q_j}{r_{ij}^2}$\n",
        "\n",
        "If you're observing carefully, you might ask, wait doesn't this mean that molecules with different numbers of atoms generate matrices of different sizes? In practice the trick to get around this is that the matrices are \"zero-padded.\" That is, if you're making coulomb matrices for a set of molecules, you pick a maximum number of atoms $N$, make the matrices $N\\times N$ and set to zero all the extra entries for this molecule. (There's a couple extra tricks that are done under the hood beyond this. Check out reference [1] or read the source code in DeepChem!)\n",
        "\n",
        "DeepChem has a built in featurization class `dc.feat.CoulombMatrixEig` that can generate these featurizations for you."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Yadcs27f2UJn",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "featurizer = dc.feat.CoulombMatrixEig(23, remove_hydrogens=False)"
      ],
      "execution_count": 11,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z9BJKEmd2UJq",
        "colab_type": "text"
      },
      "source": [
        "Note that in this case, we set the maximum number of atoms to $N = 23$. Let's now load our dataset file into DeepChem. As in the previous tutorials, we use a `Loader` class, in particular `dc.data.SDFLoader` to load our `.sdf` file into DeepChem. The following snippet shows how we do this:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "t-OldF822UJr",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# loader = dc.data.SDFLoader(\n",
        "#       tasks=[\"atomization_energy\"], smiles_field=\"smiles\",\n",
        "#       featurizer=featurizer,\n",
        "#       mol_field=\"mol\")\n",
        "# dataset = loader.featurize(dataset_file)"
      ],
      "execution_count": 12,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gQ_zcAz92UJt",
        "colab_type": "text"
      },
      "source": [
        "For the purposes of this tutorial, we're going to do a random split of the dataset into training, validation, and test. In general, this split is weak and will considerably overestimate the accuracy of our models, but for now in this simple tutorial isn't a bad place to get started."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GNhuNAZY2UJu",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# random_splitter = dc.splits.RandomSplitter()\n",
        "# train_dataset, valid_dataset, test_dataset = random_splitter.train_valid_test_split(dataset)"
      ],
      "execution_count": 13,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7ouN5JxO2UJ0",
        "colab_type": "text"
      },
      "source": [
        "One issue that Coulomb matrix featurizations have is that the range of entries in the matrix $C$ can be large. The charge $q_1q_2/r^2$ term can range very widely. In general, a wide range of values for inputs can throw off learning for the neural network. For this, a common fix is to normalize the input values so that they fall into a more standard range. Recall that the normalization transform applies to each feature $X_i$ of datapoint $X$\n",
        "\n",
        "$\\hat{X_i} = \\frac{X_i - \\mu_i}{\\sigma_i}$\n",
        "\n",
        "where $\\mu_i$ and $\\sigma_i$ are the mean and standard deviation of the $i$-th feature. This transformation enables the learning to proceed smoothly. A second point is that the atomization energies also fall across a wide range. So we apply an analogous transformation normalization transformation to the output to scale the energies better. We use DeepChem's transformation API to make this happen:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eN7aqR042UJ0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# transformers = [\n",
        "#     dc.trans.NormalizationTransformer(transform_X=True, dataset=train_dataset),\n",
        "#     dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)]\n",
        "\n",
        "# for dataset in [train_dataset, valid_dataset, test_dataset]:\n",
        "#   for transformer in transformers:\n",
        "#       dataset = transformer.transform(dataset)"
      ],
      "execution_count": 14,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": true,
        "id": "IerJqoXo2UJ5",
        "colab_type": "text"
      },
      "source": [
        "Now that we have the data cleanly transformed, let's do some simple machine learning. We'll start by constructing a random forest on top of the data. We'll use DeepChem's hyperparameter tuning module to do this."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "scrolled": true,
        "id": "UNG8EXtg2UJ6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# def rf_model_builder(model_params, model_dir):\n",
        "#   sklearn_model = RandomForestRegressor(**model_params)\n",
        "#   return dc.models.SklearnModel(sklearn_model, model_dir)\n",
        "# params_dict = {\n",
        "#     \"n_estimators\": [10, 100],\n",
        "#     \"max_features\": [\"auto\", \"sqrt\", \"log2\", None],\n",
        "# }\n",
        "\n",
        "# metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)\n",
        "# optimizer = dc.hyper.HyperparamOpt(rf_model_builder)\n",
        "# best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(\n",
        "#     params_dict, train_dataset, valid_dataset, transformers,\n",
        "#     metric=metric)"
      ],
      "execution_count": 15,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FdhT0zDD2UJ-",
        "colab_type": "text"
      },
      "source": [
        "Let's build one more model, a kernel ridge regression, on top of this raw data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LYTzmcyy2UJ-",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# def krr_model_builder(model_params, model_dir):\n",
        "#   sklearn_model = KernelRidge(**model_params)\n",
        "#   return dc.models.SklearnModel(sklearn_model, model_dir)\n",
        "\n",
        "# params_dict = {\n",
        "#     \"kernel\": [\"laplacian\"],\n",
        "#     \"alpha\": [0.0001],\n",
        "#     \"gamma\": [0.0001]\n",
        "# }\n",
        "\n",
        "# metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)\n",
        "# optimizer = dc.hyper.HyperparamOpt(krr_model_builder)\n",
        "# best_krr, best_krr_hyperparams, all_krr_results = optimizer.hyperparam_search(\n",
        "#     params_dict, train_dataset, valid_dataset, transformers,\n",
        "#     metric=metric)"
      ],
      "execution_count": 16,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IS9JTDyi2UKD",
        "colab_type": "text"
      },
      "source": [
        "# Congratulations! Time to join the Community!\n",
        "\n",
        "Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:\n",
        "\n",
        "## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)\n",
        "This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.\n",
        "\n",
        "## Join the DeepChem Gitter\n",
        "The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!\n",
        "\n",
        "# Bibliography:\n",
        "\n",
        "[1] https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.98.146401"
      ]
    }
  ]
}